But I have been doing data analysis projects in R for 15+ years
And I’ve learned a few things along the way
If you are an expert, chime in anytime!
What will I cover today?
How I try to make my project workflow reproducible, including:
{starter} to create standard project frameworks
Folder structure and naming
RStudio projects and {here} package for portability
Quarto for reproducible reporting
{renv} for reproducible environments
Context
Everything here evolved in the context of the work I do and how I do it.
Collaborate with doctors on clinical research projects: they send me data, I analyze it
Work independently as the only statistician/programmer for a given project
Patient data is sensitive
2 and 3 mean nothing goes on GitHub
2 also means that my interest in reproducibility is for future Emily and for sound science
The {starter} package
“Provides a toolkit for starting new projects”
Using {starter} with default settings
# install.packages("starter") starter::create_project(path = fs::path(tempdir(), "My Project Folder"),open =FALSE# don't open project in new RStudio session)
✔ Writing files "README.md", ".gitignore", "My Project Folder.Rproj", and ".Rprofile"
✔ Initialising Git repo
✔ Initialising renv project
- Lockfile written to "C:/Users/zabore2/AppData/Local/Temp/Rtmp4AN2BX/My Project Folder/renv.lock".
- renv infrastructure has been generated for project "C:/Users/zabore2/AppData/Local/Temp/Rtmp4AN2BX/My Project Folder".
Resulting project structure
Custom {starter} templates
The default is a great start, but I want a bit more:
Shell code files
Include Word template for Quarto
See the {starter} website for details on creating custom templates.
The R script that created my custom template in my personal R package on GitHub here.
- Lockfile written to "C:/Users/zabore2/AppData/Local/Temp/Rtmp4AN2BX/example-custom-project/renv.lock".
- renv infrastructure has been generated for project "C:/Users/zabore2/AppData/Local/Temp/Rtmp4AN2BX/example-custom-project".
Resulting custom project structure
Structure inside the code folder
Munge file template
Quarto file template
Structure inside the templates folder
See details on how to create your own reference document for Word output here.
Folder structure and naming
Find something that works for you and stick with it.
What I do as a collaborative biostatistician:
Store all project folders on the same drive, backed up by my organization
Give each project a folder
Name the folder as “PIName-brief-project-description”.
For example, a project with Jane Smith about treatment for metastatic breast cancer might be “Smith-metastatic-breast-trt”
Initialize using {starter}
Also add a “data” folder
Project reports produced by Quarto saved in main project folder as, e.g., “Smith-metastatic-breast-trt-report-2025-10-18” for version control
RStudio projects
Benefits of working inside an RStudio project include:
Starts a fresh R session every time the project is opened
The current working directory is set to the project directory
Previously open R scripts are restored at project startup
Other RStudio settings are restored
Multiple RStudio sessions can be open at one time, running independently in different RStudio projects
Creating RStudio projects
Automatically using the {starter} package
File menu in RStudio
Project menu in RStudio
RStudio project from the file menu
RStudio project from the file menu
RStudio project from the file menu
RStudio project from the project menu
Workflow with RStudio project
The {here} package
“Easy file referencing in project-oriented workflows”
What does it do?
Creates paths relative to the top-level directory.
Started using RMarkdown reports, switched to Quarto.
Very easy to switch and I still use a lot of RMarkdown style programming in my Quarto files.
Never again:
hardcode a number
have separate documents for text and tables
manually create tables
have difficulty updating results when data change
Easily mix code chunks with text
Report numbers in line in a programmatic way
Separate files for data preparation and data reporting
Recall my starter template created two shell documents:
R script where data are cleaned and coded and saved into .rda
Quarto file where clean data are read in, analyses done, results reported
What do I include?
I write my Quarto reports with three main sections:
Notes/questions: these are notes on things I did in the data cleaning process that I want to call attention to, i.e. how categories were combined, missing data to address, etc.
Methods: A formal statistical methods section that can be copied and pasted directly into the eventual scientific publication
Results: Mostly tables and figures with some text interpretation mixed in.
Quarto output options
html: probably the most popular, has many more customization options
pdf: the trickiest to use, in my opinion, requires LaTeX
Word: unpopular, but my preference as it makes it easy to copy and paste entire tables and blocks of text from my report into the publication
Components of a Quarto file
The YAML header
Code chunks
Markdown text
The YAML header
Code chunks
Markdown text
Rendering
This places the output file inside the same folder where the .qmd file is saved, in this case in the code folder
I always “Save As” to the main project folder with the date of the file creation for version control
The {renv} package
“create reproducible environments for your R projects”
Initialize the project
First run renv::init() to initialize a new library. This was done for us with starter::create_project().
Other {renv} functions
install() to install packages from CRAN, GitHub, or Bioconductor
update() gets the latest versions of all dependencies
For collaboration with others:
snapshot() adds metadata about currently used packages to the lockfile
restore() uses metadata from the lockfile to install exactly the same version of every package
Put it all together
I am starting a new project with Dr. Jane Smith about the association between radiation treatment and overall survival in women with breast cancer. Dr. Smith has emailed me an Excel dataset to analyze for the project.
- Lockfile written to "G:/StatTeam/zabore/Smith-breast-radiation/renv.lock".
- renv infrastructure has been generated for project "G:/StatTeam/zabore/Smith-breast-radiation".
Notes:
“G:/StatTeam/zabore” is my organization’s preferred and backed-up drive on my computer
A new project folder named “Smith-breast-radiation” will be created and populated
Add a data folder and save the data there
The investigator sent me an Excel file, which I save as is
I also “Save As” a csv, which I’ll import to R for data cleaning
Open the RStudio project
Once in the RStudio project:
Open the two shell files (R script and qmd)
Start to install needed packages using renv::install()
Insert comments on speed and other issues
Read in, clean up, and save the data
This is one place where the {here} package will come in handy
library(dplyr)library(readr)# Import data ------------------------------------df0 <-read_csv(file = here::here("data", "breastcancer.csv") ) |> janitor::clean_names() |> janitor::remove_empty()# Clean data -------------------------------------df <- df0 |>mutate(# Insert data cleaning steps here ) |> labelled::set_variable_labels(# Insert variable labels here )# Save the data ----------------------------------save( df,file = here::here("data", "smith-breast-rt-data.rda"))